Understand the criterion problem: why measuring performance is harder than it seems. Learn about actual, conceptual, and composite performance measures and their limitations.
"The criterion problem is fundamentally a problem of values. What we choose to measure and how we choose to measure it reflects what we believe is important—and those beliefs may not align with what actually matters for organizational success." — James T. Austin & Peter Villanova, Journal of Applied Psychology (1992)
What if the performance metrics you trust are missing critical dimensions of success—or measuring things that have nothing to do with actual job performance? One of the most persistent and underappreciated challenges in organizational psychology is what researchers call the criterion problem: the fundamental difficulty of identifying, defining, and measuring job performance.
The gap between the ultimate criterion (what we theoretically want to measure), the conceptual criterion (how we define it), and the actual criterion (what we actually measure) creates systematic errors that distort organizational decisions.
Understanding this problem—and the empirical evidence about what works—is essential for building fair, effective performance management systems.
The criterion problem, identified decades ago by researcher Flanagan (1956) and comprehensively reviewed by Austin and Villanova (1992), asks a deceptively simple question: How do we know if someone is performing well at their job?
The answer is far more complex than it appears.
Ultimate Criterion: What We Want to Measure. The ultimate criterion is the theoretical construct—what we ideally want to assess. For a salesperson, the ultimate criterion might be "success in generating revenue while maintaining customer relationships and contributing to team effectiveness." The ultimate criterion is rarely if ever directly measurable. It's abstract, multidimensional, and context-dependent.
Conceptual Criterion: How We Define It. The conceptual criterion is how we operationalize the ultimate criterion in specific contexts. For the salesperson, we might define it as: "Individual performance on selling tasks + willingness to help colleagues + adaptability to changing market conditions." The conceptual criterion is still primarily a theoretical construct, but it's more specific and measurable.
Actual Criterion: What We Actually Measure. The actual criterion is what we measure in practice: sales figures, supervisor ratings, customer satisfaction scores, or attendance records. This is where systematic problems emerge. The actual criterion rarely, if ever, perfectly captures the conceptual criterion. The gap between conceptual and actual criteria creates two distinct problems: criterion deficiency and criterion contamination.
Criterion deficiency occurs when the actual measure fails to capture important components of the conceptual criterion. Important aspects of performance are simply omitted.
Classic Example: Imagine you're hiring an administrative assistant and you administer a work sample test involving organizing files, scheduling meetings, and managing correspondence. The test seems comprehensive and objective. However, the work sample doesn't assess the applicant's ability to type on a computer because the test uses paper-based filing systems. Yet in the actual job, computer typing is essential. Your measure is deficient—it misses a critical performance dimension.
Research Evidence on Deficiency: A meta-analysis examining supervisor ratings and objective performance measures found that supervisor ratings and objective measures showed only moderate convergence (corrected correlation of .39), indicating substantial deficiency in either approach alone. More striking: when researchers identified the three studies where objective and subjective measures tapped precisely the same performance dimension, the mean corrected correlation jumped to 0.71. This demonstrates how deficiency emerges when measures focus on overlapping rather than identical dimensions.
The Multidimensional Nature of Performance
Contemporary research increasingly recognizes that job performance is multidimensional, not unidimensional. A comprehensive framework includes:
Task Performance: Direct technical performance on core job tasks
Contextual Performance: Behaviors that support organizational culture and team effectiveness
Adaptive Performance: Ability to adjust to changing conditions, learn new skills, handle emergencies
Counterproductive Work Behaviors: Actions that harm organizational effectiveness
Critical Empirical Finding: Research by Allworth and Hesketh demonstrated that adaptive performance had differential predictors compared to task and contextual performance, confirming that these dimensions are distinct and require separate measurement. If your performance system measures only task performance (sales figures), it's deficient in contextual performance (whether the salesperson trains new colleagues), adaptive performance (whether they adjust to new sales technologies), and counterproductive behaviors.
Criterion contamination occurs when the actual measure includes factors unrelated to true job performance. The measure "gathers information we don't need" and this noise distorts true performance assessment.
Classic Example—The Halo Effect: Imagine a manager rates an employee highly on "meeting deadlines" because the employee is physically attractive and they like them personally (even though their deadlines are often missed). The rating is contaminated by factors unrelated to deadline performance.
Research Evidence on Contamination: A meta-analysis by Podsakoff et al. (2013) found that supervisor ratings are influenced by factors such as ethnicity, gender, and the quality of the leader-member relationship, even when these factors are unrelated to actual job performance. Similarly, research on medical residents found that supervisor ratings were contaminated by demographic similarity—supervisors rated demographically similar residents higher even when objective measures of clinical performance showed no differences.
An insidious form of contamination emerges from scale differences. A comprehensive study of hospital composite performance measures (combining process metrics and outcome metrics) found that:
Hospitals' composite scores were almost entirely driven by process measures (explained variation > 99%)
Patient survival outcomes, despite being theoretically critical, contributed negligible information (explained variation = 4%)
This wasn't because there were more process measures. When researchers increased the weight of outcomes 5-fold, survival still contributed less than any individual process item. Why? Because process measures were on scales with much larger standard deviations, making variation in process measures dominate the composite regardless of their theoretical importance.
Organizations typically choose between objective (hard) criteria derived from organizational records and subjective criteria relying on human judgment. Each approach has distinct advantages and limitations.
Objective criteria derive from organizational records and appear to involve minimal subjective judgment: sales volume, productivity counts, absence rates, error rates.
Advantages: Appear objective and defensible; Easy to collect from existing systems; Reduce rater bias concerns; Directly relevant to organizational outputs.
Limitations: (1) Context Beyond Individual Control: A meta-analysis by Bommer et al. (1995) found that salespeople's objective performance is heavily influenced by economic downturns, consumer preferences, and territory assignment—factors completely outside individual control. (2) Incomplete Performance Capture: A salesperson might have excellent sales figures but accomplish this through aggressive, deceptive practices that harm customer relationships. (3) Goodhart's Law: When a measure becomes a target, it ceases to be a good measure.
Subjective measures rely on human judgment through supervisor ratings, peer evaluations, or self-assessments.
Advantages: Capture nuanced dimensions difficult to quantify (teamwork, creativity, adaptability); Can consider context and special circumstances; Flexible across different roles.
Limitations: Rater bias; Demographic similarity bias; Relationship quality contamination. Empirical Evidence: When objective and subjective measures target the same performance dimension, they show moderate-to-strong convergence (r = .39 to .71). However, when different sources measure performance, correlations often drop to .27-.39. This suggests that objective and subjective measures capture somewhat different aspects and neither alone provides a complete picture.
1. Use Multiple Criteria from Different Sources: Rather than selecting a single "best" criterion, use multiple measures capturing different performance dimensions from different sources. Meta-analytic research found that measurement methods yield highest scores for reliability and validity when using multiple approaches.
2. Conduct Thorough Job Analysis: Before choosing criteria, conduct comprehensive job analysis identifying all performance dimensions necessary for success, relative importance of each dimension, how context affects performance, and what factors beyond individual control influence each dimension.
3. Make Weighting Decisions Explicit: For composite criteria, make weighting decisions explicit and intentional. Ground weights in organizational strategy, conduct sensitivity analysis, and communicate to stakeholders why dimensions were weighted differently.
4. Separate Task and Context Dimensions: Research by Borman and Motowidlo (1993) showed that task performance and contextual performance are distinct constructs with different predictors. Measure them separately rather than combining into a single score.
The criterion problem has no perfect solution. The gap between ultimate criterion (what we theoretically want to measure), conceptual criterion (how we define it), and actual criterion (what we measure) is inevitable.
However, organizations can acknowledge this gap and manage it strategically through: recognizing that no single measure captures all performance dimensions; using multiple criteria from different sources; making weighting decisions explicit and intentional; conducting regular sensitivity analysis; and accepting that performance measurement is an approximation, not truth. The most effective performance management systems aren't those claiming to measure "true performance." They're systems that acknowledge measurement limitations while making principled choices about what aspects of performance matter most in specific contexts.
Organization Learning Labs offers comprehensive job analysis services and performance measurement system design to help organizations navigate the criterion problem strategically, reducing bias while capturing the multidimensional nature of job performance. Contact us at research@organizationlearninglabs.com.
Austin, J. T., & Villanova, P. (1992). The criterion problem: 1917-1992. Journal of Applied Psychology, 77(3), 836-874.
Bommer, W. H., Johnson, J. L., Rich, G. A., Podsakoff, P. M., & MacKenzie, S. B. (1995). On the interchangeability of objective and subjective measures. Academy of Management Journal, 38(6), 1691-1695.
Borman, W. C., & Motowidlo, S. J. (1993). Expanding the criterion domain to include elements of contextual performance. Personnel Selection in Organizations, 71-98.
O'Brien, S. M., et al. (2007). Exploring the behavior of hospital composite performance scores. Circulation, 116(25), 2917-2924.
Comments